Nusa Tenggara Islands
Enhancing Poverty Targeting with Spatial Machine Learning: An application to Indonesia
Martinez, Rolando Gonzales, Cooray, Mariza
This study leverages spatial machine learning (SML) to enhance the accuracy of Proxy Means Testing (PMT) for poverty targeting in Indonesia. Conventional PMT methodologies are prone to exclusion and inclusion errors due to their inability to account for spatial dependencies and regional heterogeneity. By integrating spatial contiguity matrices, SML models mitigate these limitations, facilitating a more precise identification and comparison of geographical poverty clusters. Utilizing household survey data from the Social Welfare Integrated Data Survey (DTKS) for the periods 2016 to 2020 and 2016 to 2021, this study examines spatial patterns in income distribution and delineates poverty clusters at both provincial and district levels. Empirical findings indicate that the proposed SML approach reduces exclusion errors from 28% to 20% compared to standard machine learning models, underscoring the critical role of spatial analysis in refining machine learning-based poverty targeting. These results highlight the potential of SML to inform the design of more equitable and effective social protection policies, particularly in geographically diverse contexts. Future research can explore the applicability of spatiotemporal models and assess the generalizability of SML approaches across varying socio-economic settings.
NusaAksara: A Multimodal and Multilingual Benchmark for Preserving Indonesian Indigenous Scripts
Adilazuarda, Muhammad Farid, Wijanarko, Musa Izzanardi, Susanto, Lucky, Nur'aini, Khumaisa, Wijaya, Derry, Aji, Alham Fikri
Indonesia is rich in languages and scripts. However, most NLP progress has been made using romanized text. In this paper, we present NusaAksara, a novel public benchmark for Indonesian languages that includes their original scripts. Our benchmark covers both text and image modalities and encompasses diverse tasks such as image segmentation, OCR, transliteration, translation, and language identification. Our data is constructed by human experts through rigorous steps. NusaAksara covers 8 scripts across 7 languages, including low-resource languages not commonly seen in NLP benchmarks. Although unsupported by Unicode, the Lampung script is included in this dataset. We benchmark our data across several models, from LLMs and VLMs such as GPT-4o, Llama 3.2, and Aya 23 to task-specific systems such as PP-OCR and LangID, and show that most NLP technologies cannot handle Indonesia's local scripts, with many achieving near-zero performance.
Learning-based estimation of cattle weight gain and its influencing factors
Hossain, Muhammad Riaz Hasib, Islam, Rafiqul, McGrath, Shawn R., Islam, Md Zahidul, Lamb, David
Many cattle farmers still depend on manual methods to measure the live weight gain of cattle at set intervals, which is time consuming, labour intensive, and stressful for both the animals and handlers. A remote and autonomous monitoring system using machine learning (ML) or deep learning (DL) can provide a more efficient and less invasive method and also predictive capabilities for future cattle weight gain (CWG). This system allows continuous monitoring and estimation of individual cattle live weight gain, growth rates and weight fluctuations considering various factors like environmental conditions, genetic predispositions, feed availability, movement patterns and behaviour. Several researchers have explored the efficiency of estimating CWG using ML and DL algorithms. However, estimating CWG suffers from a lack of consistency in its application. Moreover, ML or DL can provide weight gain estimations based on several features that vary in existing research. Additionally, previous studies have encountered various data related challenges when estimating CWG. This paper presents a comprehensive investigation in estimating CWG using advanced ML techniques based on research articles (between 2004 and 2024). This study investigates the current tools, methods, and features used in CWG estimation, as well as their strengths and weaknesses. The findings highlight the significance of using advanced ML approaches in CWG estimation and its critical influence on factors. Furthermore, this study identifies potential research gaps and provides research direction on CWG prediction, which serves as a reference for future research in this area.
LLM for Everyone: Representing the Underrepresented in Large Language Models
Natural language processing (NLP) has witnessed a profound impact of large language models (LLMs) that excel in a multitude of tasks. However, the limitation of LLMs in multilingual settings, particularly in underrepresented languages, remains a significant hurdle. This thesis aims to bridge the gap in NLP research and development by focusing on underrepresented languages. A comprehensive evaluation of LLMs is conducted to assess their capabilities in these languages, revealing the challenges of multilingual and multicultural generalization. Addressing the multilingual generalization gap, this thesis proposes data-and-compute-efficient methods to mitigate the disparity in LLM ability in underrepresented languages, allowing better generalization on underrepresented languages without the loss of task generalization ability. The proposed solutions cover cross-lingual continual instruction tuning, retrieval-based cross-lingual in-context learning, and in-context query alignment. Furthermore, a novel method to measure cultural values alignment between LLMs operating in different languages is proposed, ensuring cultural sensitivity and inclusivity. These contributions aim to enhance the multilingual and multicultural alignment of LLMs in underrepresented languages, ultimately advancing the NLP field toward greater equality and inclusiveness.
IndoCulture: Exploring Geographically-Influenced Cultural Commonsense Reasoning Across Eleven Indonesian Provinces
Koto, Fajri, Mahendra, Rahmad, Aisyah, Nurul, Baldwin, Timothy
Although commonsense reasoning is greatly shaped by cultural and geographical factors, previous studies on language models have predominantly centered on English cultures, potentially resulting in an Anglocentric bias. In this paper, we introduce IndoCulture, aimed at understanding the influence of geographical factors on language model reasoning ability, with a specific emphasis on the diverse cultures found within eleven Indonesian provinces. In contrast to prior works that relied on templates (Yin et al., 2022) and online scrapping (Fung et al., 2024), we created IndoCulture by asking local people to manually develop the context and plausible options based on predefined topics. Evaluations of 23 language models reveal several insights: (1) even the best open-source model struggles with an accuracy of 53.2%, (2) models often provide more accurate predictions for specific provinces, such as Bali and West Java, and (3) the inclusion of location contexts enhances performance, especially in larger models like GPT-4, emphasizing the significance of geographical context in commonsense reasoning.
Constructing and Expanding Low-Resource and Underrepresented Parallel Datasets for Indonesian Local Languages
Lopo, Joanito Agili, Tanone, Radius
In Indonesia, local languages play an integral role in the culture. However, the available Indonesian language resources still fall into the category of limited data in the Natural Language Processing (NLP) field. This is become problematic when build NLP model for these languages. To address this gap, we introduce Bhinneka Korpus, a multilingual parallel corpus featuring five Indonesian local languages. Our goal is to enhance access and utilization of these resources, extending their reach within the country. We explained in a detail the dataset collection process and associated challenges. Additionally, we experimented with translation task using the IBM Model 1 due to data constraints. The result showed that the performance of each language already shows good indications for further development. Challenges such as lexical variation, smoothing effects, and cross-linguistic variability are discussed. We intend to evaluate the corpus using advanced NLP techniques for low-resource languages, paving the way for multilingual translation models.
BHASA: A Holistic Southeast Asian Linguistic and Cultural Evaluation Suite for Large Language Models
Leong, Wei Qi, Ngui, Jian Gang, Susanto, Yosephine, Rengarajan, Hamsawardhini, Sarveswaran, Kengatharaiyer, Tjhi, William Chandra
The rapid development of Large Language Models (LLMs) and the emergence of novel abilities with scale have necessitated the construction of holistic, diverse and challenging benchmarks such as HELM and BIG-bench. However, at the moment, most of these benchmarks focus only on performance in English and evaluations that include Southeast Asian (SEA) languages are few in number. We therefore propose BHASA, a holistic linguistic and cultural evaluation suite for LLMs in SEA languages. It comprises three components: (1) a NLP benchmark covering eight tasks across Natural Language Understanding (NLU), Generation (NLG) and Reasoning (NLR) tasks, (2) LINDSEA, a linguistic diagnostic toolkit that spans the gamut of linguistic phenomena including syntax, semantics and pragmatics, and (3) a cultural diagnostics dataset that probes for both cultural representation and sensitivity. For this preliminary effort, we implement the NLP benchmark only for Indonesian, Vietnamese, Thai and Tamil, and we only include Indonesian and Tamil for LINDSEA and the cultural diagnostics dataset. As GPT-4 is purportedly one of the best-performing multilingual LLMs at the moment, we use it as a yardstick to gauge the capabilities of LLMs in the context of SEA languages. Our initial experiments on GPT-4 with BHASA find it lacking in various aspects of linguistic capabilities, cultural representation and sensitivity in the targeted SEA languages. BHASA is a work in progress and will continue to be improved and expanded in the future.
NusaCrowd: Open Source Initiative for Indonesian NLP Resources
Cahyawijaya, Samuel, Lovenia, Holy, Aji, Alham Fikri, Winata, Genta Indra, Wilie, Bryan, Mahendra, Rahmad, Wibisono, Christian, Romadhony, Ade, Vincentio, Karissa, Koto, Fajri, Santoso, Jennifer, Moeljadi, David, Wirawan, Cahya, Hudi, Frederikus, Parmonangan, Ivan Halim, Alfina, Ika, Wicaksono, Muhammad Satrio, Putra, Ilham Firdausi, Rahmadani, Samsul, Oenang, Yulianti, Septiandri, Ali Akbar, Jaya, James, Dhole, Kaustubh D., Suryani, Arie Ardiyanti, Putri, Rifki Afina, Su, Dan, Stevens, Keith, Nityasya, Made Nindyatama, Adilazuarda, Muhammad Farid, Ignatius, Ryan, Diandaru, Ryandito, Yu, Tiezheng, Ghifari, Vito, Dai, Wenliang, Xu, Yan, Damapuspita, Dyah, Tho, Cuk, Karo, Ichwanul Muslim Karo, Fatyanosa, Tirana Noor, Ji, Ziwei, Fung, Pascale, Neubig, Graham, Baldwin, Timothy, Ruder, Sebastian, Sujaini, Herry, Sakti, Sakriani, Purwarianti, Ayu
We present NusaCrowd, a collaborative initiative to collect and unify existing resources for Indonesian languages, including opening access to previously non-public resources. Through this initiative, we have brought together 137 datasets and 118 standardized data loaders. The quality of the datasets has been assessed manually and automatically, and their value is demonstrated through multiple experiments. NusaCrowd's data collection enables the creation of the first zero-shot benchmarks for natural language understanding and generation in Indonesian and the local languages of Indonesia. Furthermore, NusaCrowd brings the creation of the first multilingual automatic speech recognition benchmark in Indonesian and the local languages of Indonesia. Our work strives to advance natural language processing (NLP) research for languages that are under-represented despite being widely spoken.
Satellite Monitoring of Terrestrial Plastic Waste
Kruse, Caleb, Boyda, Edward, Chen, Sully, Karra, Krishna, Bou-Nahra, Tristan, Hammer, Dan, Mathis, Jennifer, Maddalene, Taylor, Jambeck, Jenna, Laurier, Fabien
Plastic waste is a significant environmental pollutant that is difficult to monitor. We created a system of neural networks to analyze spectral, spatial, and temporal components of Sentinel-2 satellite data to identify terrestrial aggregations of waste. The system works at continental scale. We evaluated performance in Indonesia and detected 374 waste aggregations, more than double the number of sites found in public databases. The same system deployed across twelve countries in Southeast Asia identifies 996 subsequently confirmed waste sites. For each detected site, we algorithmically monitor waste site footprints through time and cross-reference other datasets to generate physical and social metadata. 19% of detected waste sites are located within 200 m of a waterway. Numerous sites sit directly on riverbanks, with high risk of ocean leakage.
New facial recognition technology caught 'imposter' using someone else's passport, US officials say
A new facial recognition technology caught a man trying to enter the US using a passport belonging to someone else, US officials say. Officials with the US Customs and Border Protection (CBP) and the Office of Field Operations (OFO) intercepted a 26-year-old man, the agencies referred to as an "imposter", who reportedly attempted to use a French passport belonging to someone else, at Washington's Dulles International Airport. The man was travelling to the US from Brazil. "The officer utilised CBP's new facial comparison biometric technology which confirmed the man was not a match to the passport he presented," the CBP press release read. It added: "A search revealed the man's authentic Republic of Congo identification card concealed in his shoe."